MS4S16 Assessment 1 Prepared by Francis Afuwah. ID: 30074904¶

1. Introduction¶

This analysis examines the Wisconsin Breast Cancer dataset, an important medical resource for breast cancer detection and diagnosis. The collection, collected by Dr. William H. Wolberg, contains vital breast cancer tumor cell nuclei data. Exploring and understanding the dataset using sophisticated machine learning is the main goal. Data splitting, handling missing values, and outliers are pre-processing processes. We also do extensive Exploratory Data Analysis (EDA) including feature engineering and statistical analyses. We then use clustering methods and dimensionality reduction in unsupervised machine learning to reveal patterns. Following this, we use supervised machine learning for classification and regression. With ethical concerns in healthcare data analysis, our analysis is strengthened by informative visualizations, detailed assessments, and reflections on the methodology used.

Import libraries¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import IsolationForest
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.metrics import accuracy_score, silhouette_score
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import classification_report, confusion_matrix, mean_squared_error
from scipy.stats import shapiro, mannwhitneyu
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings("ignore")

2. Pre-processing and Exploratory Data Analysis¶

2.1 Load the dataset¶

In [2]:
data = pd.read_csv("MS4S16_Dataset.csv")

2.2 Data Splitting and Handling Outliers¶

In [3]:
# Splitting the dataset into Training Set and Test Set
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)
In [4]:
# checking null values
data.isnull().sum()
Out[4]:
id                          3
diagnosis                   3
radius_mean                 5
texture_mean                6
perimeter_mean              4
area_mean                   5
smoothness_mean             3
compactness_mean            4
concavity_mean              4
concave points_mean         8
symmetry_mean               3
fractal_dimension_mean      4
radius_se                   6
texture_se                  8
perimeter_se                3
area_se                     6
smoothness_se               6
compactness_se              7
concavity_se                8
concave points_se           9
symmetry_se                 8
fractal_dimension_se        7
radius_worst               13
texture_worst              21
perimeter_worst             6
area_worst                  4
smoothness_worst            9
compactness_worst           4
concavity_worst             3
concave points_worst        6
symmetry_worst              4
fractal_dimension_worst    13
dtype: int64
In [5]:
# Handle missing values
numeric_columns = train_data.select_dtypes(include=[np.number]).columns
imputer = SimpleImputer(strategy="mean")
train_data[numeric_columns] = imputer.fit_transform(train_data[numeric_columns])
test_data[numeric_columns] = imputer.transform(test_data[numeric_columns])
In [6]:
# Handle duplicated values
train_data = train_data.drop_duplicates(keep="first")
In [7]:
data.describe()
Out[7]:
id radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
count 5.680000e+02 566.000000 565.000000 567.000000 566.000000 568.000000 567.000000 567.000000 563.000000 568.000000 ... 558.000000 550.000000 565.000000 567.000000 562.000000 567.000000 568.000000 565.000000 567.000000 558.000000
mean 3.011402e+07 14.103267 -241.973664 91.949048 654.942403 0.096312 0.104333 0.088712 -3.500369 0.187402 ... 16.269794 25.735691 110.948035 897.936508 0.132469 0.254412 0.272125 0.114470 0.290327 0.084020
std 1.250894e+08 3.517424 445.216862 24.358029 352.555899 0.014178 0.052878 0.079739 59.492306 0.115008 ... 4.842370 6.123776 59.245691 688.231051 0.022865 0.157582 0.208867 0.065854 0.061907 0.018171
min 8.670000e+03 6.981000 -999.000000 43.790000 143.500000 0.052630 0.019380 0.000000 -999.000000 0.000700 ... 7.930000 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.156500 0.055040
25% 8.690778e+05 11.692500 -999.000000 75.190000 420.300000 0.086290 0.064710 0.029520 0.019885 0.161900 ... 13.015000 21.222500 84.160000 515.550000 0.116850 0.147450 0.114475 0.064930 0.250450 0.071318
50% 9.060010e+05 13.320000 17.000000 86.240000 548.750000 0.095895 0.092630 0.061540 0.033340 0.179200 ... 14.965000 25.455000 97.820000 686.600000 0.131350 0.214100 0.227450 0.099930 0.282600 0.079960
75% 8.812852e+06 15.780000 21.010000 104.200000 787.050000 0.105325 0.130400 0.130000 0.073520 0.195700 ... 18.782500 29.705000 126.900000 1091.500000 0.146000 0.339500 0.383500 0.161300 0.318550 0.092088
max 9.113205e+08 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 2.100000 ... 36.040000 49.540000 910.000000 10056.000000 0.222600 1.058000 1.252000 0.291000 0.663800 0.207500

8 rows × 31 columns

In [8]:
# Handle outliers
outlier_detector = IsolationForest(contamination=0.05)
train_data["outlier"] = outlier_detector.fit_predict(train_data.drop(["diagnosis", "id"], axis=1))
train_data = train_data[train_data["outlier"] != -1].drop("outlier", axis=1)
In [9]:
# Read the dataset from 'MS4S16_Dataset.csv'
data = pd.read_csv('MS4S16_Dataset.csv')

# Check data types of each column
for column in data.columns:
    if data[column].dtype == 'object':
        # If the column has string data, fill use the mode (most frequent value) to fill NaN values
        data[column] = data[column].fillna(data[column].mode()[0])
    else:
        # For numeric columns, replace NaN with mean
        data[column] = data[column].fillna(data[column].mean())

# Display the DataFrame after handling NaN values
print("DataFrame after handling NaN values:")
print(data)
DataFrame after handling NaN values:
             id diagnosis  radius_mean  texture_mean  perimeter_mean  \
0      842302.0         M        17.99         10.38          122.80   
1      842517.0         M        20.57         17.77          132.90   
2    84300903.0         M        19.69         21.25          130.00   
3    84348301.0         M        11.42         20.38           77.58   
4    84358402.0         M        20.29         14.34          135.10   
..          ...       ...          ...           ...             ...   
566    926682.0         M        20.13         28.25          131.20   
567    926954.0         M        16.60         28.08          108.30   
568    927241.0         M        20.60         29.33          140.10   
569     92751.0         B         7.76         24.54           47.92   
570     92751.0         B         7.76         24.54           47.92   

     area_mean  smoothness_mean  compactness_mean  concavity_mean  \
0       1001.0          0.11840           0.27760         0.30010   
1       1326.0          0.08474           0.07864         0.08690   
2       1203.0          0.10960           0.15990         0.19740   
3        386.1          0.14250           0.28390         0.24140   
4       1297.0          0.10030           0.13280         0.19800   
..         ...              ...               ...             ...   
566     1261.0          0.09780           0.10340         0.14400   
567      858.1          0.08455           0.10230         0.09251   
568     1265.0          0.11780           0.27700         0.35140   
569      181.0          0.05263           0.04362         0.00000   
570      181.0          0.05263           0.04362         0.00000   

     concave points_mean  ...  radius_worst  texture_worst  perimeter_worst  \
0                0.14710  ...        25.380          17.33           184.60   
1                0.07017  ...        24.990          23.41           158.80   
2                0.12790  ...        23.570          25.53           152.50   
3                0.10520  ...        14.910          26.50            98.87   
4                0.10430  ...        22.540          16.67           152.20   
..                   ...  ...           ...            ...              ...   
566              0.09791  ...        23.690          38.25           155.00   
567              0.05302  ...        18.980          34.12           126.70   
568              0.15200  ...        25.740          39.42           184.60   
569              0.00000  ...         9.456          30.37            59.16   
570              0.00000  ...         9.456          30.37            59.16   

     area_worst  smoothness_worst  compactness_worst  concavity_worst  \
0        2019.0           0.16220            0.66560           0.7119   
1        1956.0           0.12380            0.18660           0.2416   
2        1709.0           0.14440            0.42450           0.4504   
3         567.7           0.20980            0.86630           0.6869   
4        1575.0           0.13740            0.20500           0.4000   
..          ...               ...                ...              ...   
566      1731.0           0.11660            0.19220           0.3215   
567      1124.0           0.11390            0.30940           0.3403   
568      1821.0           0.16500            0.86810           0.9387   
569       268.6           0.08996            0.06444           0.0000   
570       268.6           0.08996            0.06444           0.0000   

     concave points_worst  symmetry_worst  fractal_dimension_worst  
0                  0.2654          0.4601                  0.11890  
1                  0.1860          0.2750                  0.08902  
2                  0.2430          0.3613                  0.08758  
3                  0.2575          0.6638                  0.17300  
4                  0.1625          0.2364                  0.07678  
..                    ...             ...                      ...  
566                0.1628          0.2572                  0.06637  
567                0.1418          0.2218                  0.07820  
568                0.2650          0.4087                  0.12400  
569                0.0000          0.2871                  0.07039  
570                0.0000          0.2871                  0.07039  

[571 rows x 32 columns]
In [10]:
data.describe().T
Out[10]:
count mean std min 25% 50% 75% max
id 571.0 3.011402e+07 1.247598e+08 8.670000e+03 869161.000000 906290.000000 8.836916e+06 9.113205e+08
radius_mean 571.0 1.410327e+01 3.501963e+00 6.981000e+00 11.705000 13.380000 1.576500e+01 2.811000e+01
texture_mean 571.0 -2.419737e+02 4.428674e+02 -9.990000e+02 -999.000000 16.950000 2.099500e+01 3.928000e+01
perimeter_mean 571.0 9.194905e+01 2.427241e+01 4.379000e+01 75.235000 86.490000 1.039500e+02 1.885000e+02
area_mean 571.0 6.549424e+02 3.510062e+02 1.435000e+02 420.400000 552.400000 7.826500e+02 2.501000e+03
smoothness_mean 571.0 9.631188e-02 1.414045e-02 5.263000e-02 0.086390 0.095940 1.053000e-01 1.634000e-01
compactness_mean 571.0 1.043333e-01 5.269249e-02 1.938000e-02 0.065090 0.094450 1.303500e-01 3.454000e-01
concavity_mean 571.0 8.871189e-02 7.945879e-02 0.000000e+00 0.029570 0.061810 1.282500e-01 4.268000e-01
concave points_mean 571.0 -3.500369e+00 5.907334e+01 -9.990000e+02 0.019420 0.032640 7.052500e-02 2.012000e-01
symmetry_mean 571.0 1.874016e-01 1.147049e-01 7.000000e-04 0.161900 0.179300 1.956500e-01 2.100000e+00
fractal_dimension_mean 571.0 -1.403337e+01 1.175208e+02 -9.990000e+02 0.057470 0.061400 6.600500e-02 9.744000e-02
radius_se 571.0 4.057996e-01 2.766729e-01 1.115000e-01 0.234100 0.326500 4.759500e-01 2.873000e+00
texture_se 571.0 1.220584e+00 5.485633e-01 3.602000e-01 0.846600 1.142000 1.472000e+00 4.885000e+00
perimeter_se 571.0 2.868202e+00 2.020423e+00 7.570000e-01 1.609000 2.289000 3.343500e+00 2.198000e+01
area_se 571.0 4.035017e+01 4.548431e+01 2.100000e+00 17.830000 24.620000 4.493500e+01 5.422000e+02
smoothness_se 571.0 7.026232e-03 2.982128e-03 1.713000e-03 0.005213 0.006399 8.077500e-03 3.113000e-02
compactness_se 571.0 2.534154e-02 1.772102e-02 2.252000e-03 0.013115 0.020520 3.206500e-02 1.354000e-01
concavity_se 571.0 3.189976e-02 3.012063e-02 0.000000e+00 0.015100 0.026110 4.161500e-02 3.960000e-01
concave points_se 571.0 1.178110e-02 6.164845e-03 0.000000e+00 0.007654 0.011090 1.464500e-02 5.279000e-02
symmetry_se 571.0 2.059438e-02 8.242557e-03 7.882000e-03 0.015200 0.018790 2.348500e-02 7.895000e-02
fractal_dimension_se 571.0 6.006402e-03 5.016246e-02 2.000000e-07 0.002253 0.003230 4.596500e-03 1.200000e+00
radius_worst 571.0 1.626979e+01 4.786831e+00 7.930000e+00 13.055000 15.050000 1.853000e+01 3.604000e+01
texture_worst 571.0 2.573569e+01 6.009911e+00 1.202000e+01 21.400000 25.590000 2.941000e+01 4.954000e+01
perimeter_worst 571.0 1.109480e+02 5.893305e+01 5.041000e+01 84.385000 98.270000 1.265000e+02 9.100000e+02
area_worst 571.0 8.979365e+02 6.858120e+02 1.852000e+02 515.850000 688.600000 1.086000e+03 1.005600e+04
smoothness_worst 571.0 1.324685e-01 2.268378e-02 7.117000e-02 0.117150 0.131600 1.458000e-01 2.226000e-01
compactness_worst 571.0 2.544117e-01 1.570279e-01 2.729000e-02 0.147750 0.215800 3.381000e-01 1.058000e+00
concavity_worst 571.0 2.721248e-01 2.083164e-01 0.000000e+00 0.115450 0.229800 3.819000e-01 1.252000e+00
concave points_worst 571.0 1.144705e-01 6.550680e-02 0.000000e+00 0.064985 0.101200 1.611000e-01 2.910000e-01
symmetry_worst 571.0 2.903266e-01 6.168968e-02 1.565000e-01 0.250550 0.282700 3.181500e-01 6.638000e-01
fractal_dimension_worst 571.0 8.401952e-02 1.796254e-02 5.504000e-02 0.071835 0.080200 9.195000e-02 2.075000e-01
In [11]:
data.isnull().sum()
Out[11]:
id                         0
diagnosis                  0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64
In [12]:
# Encoding categorical features
label_encoder = LabelEncoder()
train_data["diagnosis"] = label_encoder.fit_transform(train_data["diagnosis"])
test_data["diagnosis"] = label_encoder.transform(test_data["diagnosis"])

2.3 Feature Engineering and Visualization¶

In [13]:
# Feature engineering
feature_selector = SelectKBest(f_classif, k=10)
selected_features = feature_selector.fit_transform(train_data.drop(["diagnosis", "id"], axis=1), train_data["diagnosis"])
In [14]:
data.hist(figsize = (20,20))
Out[14]:
array([[<Axes: title={'center': 'id'}>,
        <Axes: title={'center': 'radius_mean'}>,
        <Axes: title={'center': 'texture_mean'}>,
        <Axes: title={'center': 'perimeter_mean'}>,
        <Axes: title={'center': 'area_mean'}>,
        <Axes: title={'center': 'smoothness_mean'}>],
       [<Axes: title={'center': 'compactness_mean'}>,
        <Axes: title={'center': 'concavity_mean'}>,
        <Axes: title={'center': 'concave points_mean'}>,
        <Axes: title={'center': 'symmetry_mean'}>,
        <Axes: title={'center': 'fractal_dimension_mean'}>,
        <Axes: title={'center': 'radius_se'}>],
       [<Axes: title={'center': 'texture_se'}>,
        <Axes: title={'center': 'perimeter_se'}>,
        <Axes: title={'center': 'area_se'}>,
        <Axes: title={'center': 'smoothness_se'}>,
        <Axes: title={'center': 'compactness_se'}>,
        <Axes: title={'center': 'concavity_se'}>],
       [<Axes: title={'center': 'concave points_se'}>,
        <Axes: title={'center': 'symmetry_se'}>,
        <Axes: title={'center': 'fractal_dimension_se'}>,
        <Axes: title={'center': 'radius_worst'}>,
        <Axes: title={'center': 'texture_worst'}>,
        <Axes: title={'center': 'perimeter_worst'}>],
       [<Axes: title={'center': 'area_worst'}>,
        <Axes: title={'center': 'smoothness_worst'}>,
        <Axes: title={'center': 'compactness_worst'}>,
        <Axes: title={'center': 'concavity_worst'}>,
        <Axes: title={'center': 'concave points_worst'}>,
        <Axes: title={'center': 'symmetry_worst'}>],
       [<Axes: title={'center': 'fractal_dimension_worst'}>, <Axes: >,
        <Axes: >, <Axes: >, <Axes: >, <Axes: >]], dtype=object)
In [15]:
# Visualization and exploratory analysis
# Distribution of target variable
plt.figure(figsize=(8, 5))
sns.countplot(x="diagnosis", data=train_data)
plt.title("Distribution of Diagnosis (Malignant=1, Benign=0)")
plt.show()
In [16]:
# Calculating probability
data['diagnosis'].value_counts()/len(data)*100
Out[16]:
diagnosis
B    62.872154
M    37.127846
Name: count, dtype: float64
In [17]:
data['diagnosis'].value_counts().plot.pie(startangle=50, autopct='%1.1f%%')
plt.show()
In [18]:
# Correlation heatmap
# Assume variable 'diagnosis' is a non-numeric column excluded
plt.figure(figsize=(20, 15))
numeric_columns = data.select_dtypes(include=['number']).columns
numeric_data = data[numeric_columns]

# Heatmap
sns.heatmap(numeric_data.corr(), annot=True)
plt.show()
In [19]:
# Pairplot for selected features
selected_features_df = pd.DataFrame(selected_features, columns=train_data.columns[2:][feature_selector.get_support()])
selected_features_df["diagnosis"] = train_data["diagnosis"]

sns.pairplot(selected_features_df, hue="diagnosis", diag_kind="kde")
plt.suptitle("Pairplot for Selected Features", y=1.02)
plt.show()

2.4 Statistical Assumptions and Inferences¶

In [20]:
# Assessing statistical assumptions and inferences
# Shapiro-Wilk test for normality assumption
for feature in train_data.columns[2:]:
    stat, p_value = shapiro(train_data[feature])
    print(f"Shapiro-Wilk test for {feature}: p-value = {p_value}")
Shapiro-Wilk test for radius_mean: p-value = 1.7309971164780613e-11
Shapiro-Wilk test for texture_mean: p-value = 2.1936598324339096e-31
Shapiro-Wilk test for perimeter_mean: p-value = 6.742514099128405e-12
Shapiro-Wilk test for area_mean: p-value = 1.2496485520357056e-17
Shapiro-Wilk test for smoothness_mean: p-value = 0.529057502746582
Shapiro-Wilk test for compactness_mean: p-value = 6.072096384729386e-12
Shapiro-Wilk test for concavity_mean: p-value = 1.0943981992391801e-16
Shapiro-Wilk test for concave points_mean: p-value = 1.0337378771324176e-41
Shapiro-Wilk test for symmetry_mean: p-value = 1.109789147388254e-39
Shapiro-Wilk test for fractal_dimension_mean: p-value = 1.0439673559219887e-41
Shapiro-Wilk test for radius_se: p-value = 1.522855340464036e-20
Shapiro-Wilk test for texture_se: p-value = 9.386578320169647e-11
Shapiro-Wilk test for perimeter_se: p-value = 1.2664812109195955e-20
Shapiro-Wilk test for area_se: p-value = 1.6546661938016331e-25
Shapiro-Wilk test for smoothness_se: p-value = 2.5943174917049702e-17
Shapiro-Wilk test for compactness_se: p-value = 3.939458834222064e-18
Shapiro-Wilk test for concavity_se: p-value = 8.416581923317268e-17
Shapiro-Wilk test for concave points_se: p-value = 5.649330447887735e-10
Shapiro-Wilk test for symmetry_se: p-value = 2.79441463255114e-17
Shapiro-Wilk test for fractal_dimension_se: p-value = 8.895442651533939e-42
Shapiro-Wilk test for radius_worst: p-value = 7.0335532238946855e-15
Shapiro-Wilk test for texture_worst: p-value = 0.00019242154667153955
Shapiro-Wilk test for perimeter_worst: p-value = 3.814766843110038e-34
Shapiro-Wilk test for area_worst: p-value = 3.872731402230384e-30
Shapiro-Wilk test for smoothness_worst: p-value = 0.012931867502629757
Shapiro-Wilk test for compactness_worst: p-value = 5.12819030201229e-15
Shapiro-Wilk test for concavity_worst: p-value = 2.7120848427285293e-13
Shapiro-Wilk test for concave points_worst: p-value = 1.0418825802105403e-08
Shapiro-Wilk test for symmetry_worst: p-value = 6.156057109386881e-13
Shapiro-Wilk test for fractal_dimension_worst: p-value = 1.9552446921462015e-15
In [21]:
# Mann-Whitney U test for comparing distributions of benign and malignant cases
for feature in train_data.columns[2:]:
    stat, p_value = mannwhitneyu(train_data[train_data["diagnosis"]==0][feature],
                                 train_data[train_data["diagnosis"]==1][feature])
    print(f"Mann-Whitney U test for {feature}: p-value = {p_value}")
Mann-Whitney U test for radius_mean: p-value = 9.606217890757195e-49
Mann-Whitney U test for texture_mean: p-value = 5.297667926666895e-09
Mann-Whitney U test for perimeter_mean: p-value = 5.433245148759482e-51
Mann-Whitney U test for area_mean: p-value = 5.707545268513147e-49
Mann-Whitney U test for smoothness_mean: p-value = 3.896749542921967e-14
Mann-Whitney U test for compactness_mean: p-value = 9.842713126157597e-37
Mann-Whitney U test for concavity_mean: p-value = 9.038495525779925e-53
Mann-Whitney U test for concave points_mean: p-value = 1.2189396313054475e-55
Mann-Whitney U test for symmetry_mean: p-value = 4.760281492066189e-13
Mann-Whitney U test for fractal_dimension_mean: p-value = 0.7282519569708485
Mann-Whitney U test for radius_se: p-value = 5.606939812687955e-34
Mann-Whitney U test for texture_se: p-value = 0.5311141422265111
Mann-Whitney U test for perimeter_se: p-value = 6.903210913452115e-36
Mann-Whitney U test for area_se: p-value = 2.5797463849433016e-46
Mann-Whitney U test for smoothness_se: p-value = 0.14405996678951452
Mann-Whitney U test for compactness_se: p-value = 2.620399330772149e-15
Mann-Whitney U test for concavity_se: p-value = 1.236567113098441e-21
Mann-Whitney U test for concave points_se: p-value = 3.0073442359215313e-22
Mann-Whitney U test for symmetry_se: p-value = 0.02925017549776536
Mann-Whitney U test for fractal_dimension_se: p-value = 0.00010567945074825388
Mann-Whitney U test for radius_worst: p-value = 1.0763416655657176e-56
Mann-Whitney U test for texture_worst: p-value = 4.987782267252415e-21
Mann-Whitney U test for perimeter_worst: p-value = 9.973067727109487e-57
Mann-Whitney U test for area_worst: p-value = 2.15120622802156e-56
Mann-Whitney U test for smoothness_worst: p-value = 2.4934235102016866e-20
Mann-Whitney U test for compactness_worst: p-value = 1.6546556386962493e-37
Mann-Whitney U test for concavity_worst: p-value = 1.0073710431634533e-49
Mann-Whitney U test for concave points_worst: p-value = 9.314393496618401e-59
Mann-Whitney U test for symmetry_worst: p-value = 3.9936719114576526e-19
Mann-Whitney U test for fractal_dimension_worst: p-value = 9.596718603393983e-13

3. Unsupervised Machine Learning Analysis¶

3.1 K-Means¶

In [22]:
# Clustering using K-means
X_cluster = train_data.drop(["diagnosis", "id"], axis=1)
In [23]:
# Standardize the features
scaler = StandardScaler()
X_cluster_scaled = scaler.fit_transform(X_cluster)
In [24]:
# Determine the optimal number of clusters using the Elbow method
inertia = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(X_cluster_scaled)
    inertia.append(kmeans.inertia_)
In [25]:
# Plot the Elbow method
plt.figure(figsize=(8, 5))
plt.plot(range(1, 11), inertia, marker='o')
plt.title('Elbow Method for Optimal K')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.show()
In [26]:
# Based on the Elbow method, let's choose K=2
kmeans = KMeans(n_clusters=2, random_state=42)
train_data['kmeans_cluster'] = kmeans.fit_predict(X_cluster_scaled)
In [27]:
# Evaluate the clustering using silhouette score
silhouette_avg = silhouette_score(X_cluster_scaled, train_data['kmeans_cluster'])
print(f"Silhouette Score for K-means: {silhouette_avg}")
Silhouette Score for K-means: 0.32166615571611196

3.2 PCA¶

In [28]:
# Dimensionality reduction using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_cluster_scaled)
In [29]:
# Visualize the clusters in 2D using PCA
plt.figure(figsize=(10, 6))
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=train_data['kmeans_cluster'], palette='viridis', legend='full')
plt.title('Clustering Visualization using PCA')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()
In [30]:
# Clustering using Hierarchical (Agglomerative) Clustering
agg_clustering = AgglomerativeClustering(n_clusters=2)
train_data['agg_cluster'] = agg_clustering.fit_predict(X_cluster_scaled)
In [31]:
# Evaluate the clustering using silhouette score
silhouette_avg_agg = silhouette_score(X_cluster_scaled, train_data['agg_cluster'])
print(f"Silhouette Score for Agglomerative Clustering: {silhouette_avg_agg}")
Silhouette Score for Agglomerative Clustering: 0.30262020886873026

3.3 DBScan¶

In [32]:
# Clustering using DBSCAN
dbscan = DBSCAN(eps=1.5, min_samples=5)
train_data['dbscan_cluster'] = dbscan.fit_predict(X_cluster_scaled)
In [33]:
# Visualize the clusters in 2D using PCA for DBSCAN
plt.figure(figsize=(10, 6))
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=train_data['dbscan_cluster'], palette='viridis', legend='full')
plt.title('Clustering Visualization using PCA for DBSCAN')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

4. Supervised Machine Learning Analysis¶

4.1 Classification¶

In [34]:
# Section 3: Supervised machine learning analysis

# Classification
X_classify = train_data.drop(["diagnosis", "id"], axis=1)
y_classify = train_data["diagnosis"]
In [35]:
# Split the data into training and testing sets
X_train_classify, X_test_classify, y_train_classify, y_test_classify = train_test_split(
    X_classify, y_classify, test_size=0.2, random_state=42
)
In [36]:
# Train a classification model (Random Forest)
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_classify, y_train_classify)

# Predictions
y_pred_classify = clf.predict(X_test_classify)

# Evaluation for classification
accuracy_classify = clf.score(X_test_classify, y_test_classify)
print(f"Accuracy for Classification: {accuracy_classify}")

print("Classification Report:")
print(classification_report(y_test_classify, y_pred_classify))
Accuracy for Classification: 0.9770114942528736
Classification Report:
              precision    recall  f1-score   support

           0       0.97      1.00      0.98        58
           1       1.00      0.93      0.96        29

    accuracy                           0.98        87
   macro avg       0.98      0.97      0.97        87
weighted avg       0.98      0.98      0.98        87

In [37]:
# Confusion matrix
conf_matrix = confusion_matrix(y_test_classify, y_pred_classify)
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues")
plt.title("Confusion Matrix for Classification")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

4.2 Regression¶

In [38]:
# Regression
# Choose a numerical feature to predict
feature_to_predict = "radius_mean"
X_reg = train_data.drop(["diagnosis", "id"], axis=1)  # Remove 'cluster' here
y_reg = train_data[feature_to_predict]

# Split the data into training and testing sets
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

# Train a regression model (Random Forest)
regressor = RandomForestRegressor(random_state=42)
regressor.fit(X_train_reg, y_train_reg)

# Predictions
y_pred_reg = regressor.predict(X_test_reg)

# Evaluation for regression
mse_reg = mean_squared_error(y_test_reg, y_pred_reg)
print(f"Mean Squared Error for Regression: {mse_reg}")
Mean Squared Error for Regression: 0.03412270495902717
In [39]:
# Visualization for regression
plt.figure(figsize=(10, 6))
plt.scatter(X_test_reg.index, y_test_reg, label="Actual", alpha=0.7)
plt.scatter(X_test_reg.index, y_pred_reg, label="Predicted", alpha=0.7)
plt.title(f"Regression - Actual vs Predicted ({feature_to_predict})")
plt.xlabel("Sample Index")
plt.ylabel(feature_to_predict)
plt.legend()
plt.show()

5. Reflection¶

Reflecting on the Wisconsin Breast Cancer dataset analysis, the path has been illuminating and complex. During pre-processing, missing values, outliers, and feature engineering were handled carefully to maintain data integrity. Exploratory Data Analysis revealed complex correlations and patterns in the dataset. In unsupervised machine learning, clustering and dimensionality reduction revealed latent patterns in the data, helping identify important groups and attributes. The supervised learning phase increased complexity with classification and regression models predicting and classifying breast cancer instances. Throughout the analysis, model performance and ethical issues, particularly in healthcare, became more obvious. This approach has shown that data science is iterative, where insights from each phase feed future choices and improve comprehension of the information and its consequences.

6. Limitation¶

This analysis relies on the Wisconsin Breast Cancer dataset, which may not capture all breast cancer changes. As a static dataset from a given place and date, the model may not be generalizable to various populations. The dataset's size may also limit the model's capacity to detect subtleties in breast cancer trends. This limitation emphasizes the necessity to assess the dataset's representativeness and breadth before extending results to larger contexts or periods. Further research might use bigger, more varied datasets to improve machine learning models for breast cancer detection and diagnosis.

7. Ethical Implications¶

Machine learning models in healthcare, especially for breast cancer detection, bring ethical concerns that must be considered. Securely handling and protecting sensitive medical data is crucial to patient privacy and data confidentiality. Transparent communication and securing informed permission from dataset contributors are essential to upholding ethical values. Historical medical data biases like healthcare access and demographic representation should be addressed and managed to avoid repeating algorithmic projections of inequality. To maximise the technology's benefits and minimize damage, healthcare machine learning requires continual monitoring, multidisciplinary cooperation, and ethical compliance.

8. Conclusion and Recommendations¶

In conclusion, our extensive analysis of the Wisconsin Breast Cancer dataset has illuminated machine learning for breast cancer detection and diagnosis. Preprocessing and exploratory data analysis provided the groundwork for comprehending the dataset, while unsupervised machine learning revealed patterns. In the supervised learning phase, classification and regression tests showed the model's capacity to distinguish benign and malignant instances and predict key numerical characteristics. The analysis has prioritized patient privacy, openness, and bias reduction.

Moving forward, the analysis's limitations the dataset's fixed form and previous medical data biases must be acknowledged. Future research should use bigger and more varied datasets to improve model generalizability. To appropriately handle healthcare technology, healthcare practitioners, data scientists, and ethicists must collaborate. Machine learning is crucial to medical diagnosis, thus ethical norms, openness, and continual development are essential to maximize its promise in healthcare while reducing hazards.